Skip to content

Conversation

xal-0
Copy link
Member

@xal-0 xal-0 commented Aug 7, 2025

Revived version of #48244, with a slightly different approach. This version looks for a function pointer called jl_image_unpack inside compiled system images and invokes it to get the jl_image_buf_t struct. Two implementations, jl_image_unpack_zstd and jl_image_unpack_uncomp are provided (for comparison). The zstd compression is applied only to the heap image, and not the compiled code, since that can be shared across Julia processes.

TODO: test a few different compression settings and enable by default.

Example data from un-trimmed juliac "hello world":

156M  hello-uncomp
 43M  hello-zstd
 48M  hello-zstd-1
 45M  hello-zstd-5
 43M  hello-zstd-15
 39M  hello-zstd-22

$ hyperfine -w3 ./hello-uncomp 
Benchmark 1: ./hello-uncomp
  Time (mean ± σ):      74.4 ms ±   0.8 ms    [User: 51.9 ms, System: 19.0 ms]
  Range (min … max):    73.0 ms …  76.6 ms    39 runs

$ hyperfine -w3 ./hello-zstd-1
Benchmark 1: ./hello-zstd-1
  Time (mean ± σ):     152.4 ms ±   0.5 ms    [User: 138.2 ms, System: 12.0 ms]
  Range (min … max):   151.4 ms … 153.2 ms    19 runs
 
$ hyperfine -w3 ./hello-zstd-5 
Benchmark 1: ./hello-zstd-5
  Time (mean ± σ):     154.3 ms ±   0.5 ms    [User: 139.6 ms, System: 12.4 ms]
  Range (min … max):   153.5 ms … 155.2 ms    19 runs

$ hyperfine -w3 ./hello-zstd-15
Benchmark 1: ./hello-zstd-15
  Time (mean ± σ):     135.9 ms ±   0.5 ms    [User: 121.6 ms, System: 12.0 ms]
  Range (min … max):   135.1 ms … 136.5 ms    21 runs
 
$ hyperfine -w3 ./hello-zstd-22
Benchmark 1: ./hello-zstd-22
  Time (mean ± σ):     149.0 ms ±   0.6 ms    [User: 134.7 ms, System: 12.1 ms]
  Range (min … max):   147.7 ms … 150.4 ms    19 runs

src/staticdata.c Outdated
jl_dlsym(handle, "jl_image_pointers", (void**)&image->pointers, 1);

image->size = ZSTD_getFrameContentSize(data, *plen);
image->data = (char *)malloc(image->size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to mmap this with huge pages/large pages

@ViralBShah
Copy link
Member

The savings are really nice here, but is there a way to claw back some of the startup time? Thinking out aloud, is it worth looking at whether Zstd has support for AVX512 (or fancy instructions) to speed up that may not be enabled?

@xal-0
Copy link
Member Author

xal-0 commented Aug 13, 2025

Currently I'm testing lz4 as an alternative compression algorithm that sacrifices some compression ratio for decompression speed, since it is intended to be about as fast as RAM on modern CPUs. @vtjnash also had some ideas about doing decompression and relocation in a single pass that I'd like to try: in this version we touch a whole bunch of pages while decompressing, and then force them all back into cache later, when performing relocations.

@JeffBezanson
Copy link
Member

I tried a simple test with the command line zstd and lz4 (so may not be representative) and they took basically the same amount of time but zstd compression was much better. So much better that I suspect the time was made up by reading less data.

Relocating while decompressing sounds awesome if we can pull that off.

@gbaraldi
Copy link
Member

I believe we should use lz4hc, which is quite slow to compress but has similar rations to zstd (while decompressing about 2.5x faster)

@xal-0
Copy link
Member Author

xal-0 commented Aug 14, 2025

I have arrived with some crude plots, all measured by compiling the system image (un-trimmed hello world) with 1 image thread. I've also replaced the "uncompressed" benchmark with one that copies the entire heap, to see how much of the slowdown is a result of doing two passes over every page of the heap; there seems to be no major difference.

compression decompression

@gbaraldi
Copy link
Member

Can we use threads for compressing/decompressing?

@xal-0
Copy link
Member Author

xal-0 commented Aug 15, 2025

Experimenting with compressing on JULIA_IMAGE_THREADS now, in hopes that we can get away with using one of the higher zstd levels by default. It also might be relevant that all of the above tests were conducted with the compressed heap image in .data, when it could be in .rodata.

@gbaraldi
Copy link
Member

I think they technically go in ldata currently. But not sure if rodata helps the OS in any meaningful way except write protection

@xal-0 xal-0 force-pushed the compressed-sysimg branch from 7a0de7a to 0ee175b Compare August 25, 2025 17:46
@xal-0 xal-0 requested a review from gbaraldi August 25, 2025 23:32
@@ -160,6 +160,7 @@ JL_DLLEXPORT void jl_init_options(void)
0, // task_metrics
-1, // timeout_for_safepoint_straggler_s
0, // gc_sweep_always_full
0, // compress_sysimage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revisit this default, but fine for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this as an MVP that will let us see how bad the startup costs are in practice. IMO we should revisit multithreaded (de)compression and lz4hc before enabling it by default.

@gbaraldi
Copy link
Member

LGTM default aside

@ViralBShah
Copy link
Member

Should we do a pkgeval run on this one?

@KristofferC
Copy link
Member

I don't think a PkgEval run is really relevant for this PR.

@xal-0 xal-0 merged commit cbea8cf into JuliaLang:master Aug 26, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants